Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | Go Back
View Post [edit]

Poster: aronsson Date: Aug 7, 2010 8:24am
Forum: texts Subject: OCR feedback

The OCR provided by the Internet Archive has improved a lot lately, thank you! This speeds up proofreading quite a lot. I'm talking about books in Swedish and Danish that were either scanned at Toronto for IA or books scanned by Google that I uploaded as PDF with images. The improved OCR correctly recognizes words in old spelling, which is otherwise a huge problem in these languages.

So, how can I get involved in improving OCR quality even more? How is OCR handled at IA, and who would receive my feedback?

Reply to this post
Reply [edit]

Poster: stbalbach Date: Aug 7, 2010 10:12am
Forum: texts Subject: Re: OCR feedback

In they past they've recommended uploading it as a new file/work.

One way would be to submit the improved OCR text to the Distributed Proofreaders Project, where it would then be submitted to Project Gutenberg, and then eventually find its way back to Internet Archive as a Project Gutenberg text.

Reply to this post
Reply [edit]

Poster: aronsson Date: Aug 7, 2010 12:40pm
Forum: texts Subject: Re: OCR feedback

Are you saying that IA's OCR takes input from proofread texts from Project Gutenberg? Is there any documentation of how that loop works, and who's in charge of that?

Reply to this post
Reply [edit]

Poster: stbalbach Date: Aug 7, 2010 1:27pm
Forum: texts Subject: Re: OCR feedback

Distributed Proofreaders downloads OCR texts from IA (among other places), manually cleans them up ("proofreading"), submits them to Project Gutenberg, which are then posted on the PG website. Then someone (I don't know who) uploads PG texts to Internet Archive (among other places).

This post was modified by stbalbach on 2010-08-07 20:27:21